Skip to content

Conversation

@asaberui1
Copy link
Contributor

@asaberui1 asaberui1 commented Nov 30, 2025

Hi!

Thanks for your work! I really appreciate your effort in building a unified benchmark. This PR addresses a bug in the reasoning_gym/games/countdown module where the SymPy symbol replacement logic fails when the number of input numbers (num_terms) is 10 or greater.

🐛 Problem

The current implementation for generating and validating the arithmetic expression breaks down when there are more than 9 unique symbols (representing the input numbers). This causes the expression string to contain incorrect, phantom numbers, leading to an AssertionError during the final answer validation.

The error occurs because the original symbol naming scheme creates ambiguity when the indices transition from single to double digits.

🐞 Root Cause

The original symbol generation used a format like syms = symbols(f"x:{num_terms}"), resulting in symbols like x0, x1, ..., x9, x10, x11, ....

During the subsequent string replacement to substitute the symbol names (x10, x11, etc.) with the actual numbers, the string "10" within "x10" was mistakenly matched and replaced before the full "x10" symbol, causing a collision.

Example Trace:
In an example with 15 numbers:

q="Calculate 4109 using all of these numbers: 288, 258, 234, 214, 292, 221, 283, 249, 214, 251, 251, 216, 261, 289, 253.

Symbols x10 through x14 were incorrectly processed, leading to the unexpected presence of numbers like -2510, -2511, etc. (which are not in the available set) in the final expression string:

a="-2510 - 2511 + 2512 + 2513 - 2514 + 253*(214 + 251 - 251 + 258 - 234 - 249 + 288 - 261) + 283"

This expression is obviously incorrect. The symbol generation is corrected to explicitly include an underscore separator, ensuring that each symbol name is unique and unambiguous during string replacement, regardless of the number of terms.

✅ Fix

The line is updated from:

# Old logic:
# syms = symbols(f"x:{num_terms}")

to

# New logic:
syms = symbols(f"x_{0}:{num_terms}")

The following logic for expr_str generation has been updated accordingly to handle the new x_N format.

With this fix, the expression is generated correctly using only the valid input numbers:

a="-221 - 214 + 216 + 289 - 292 + 253*(214 + 251 - 251 + 258 - 234 - 249 + 288 - 261) + 283"

And the answer is 4109, which is correct.

Updated symbol naming and added safe replacement for expressions.
@asaberui1 asaberui1 marked this pull request as draft November 30, 2025 14:11
Modified return statement to include the modified expression string.
@asaberui1 asaberui1 marked this pull request as ready for review November 30, 2025 14:13
@olliestanley
Copy link
Member

Thanks for submitting this fix. Could you please also add a unit test that covers this scenario?

Add test for CountdownDataset with more than 10 numbers
@asaberui1
Copy link
Contributor Author

I am glad to help adding a unit test. With the test_countdown_more_numbers function, if you run pytest with old versions, it should raise error:

# pytest tests/test_countdown.py 
.......F..                                                                                   [100%]
============================================= FAILURES =============================================
___________________________________ test_countdown_more_numbers ____________________________________

    def test_countdown_more_numbers():
        """Test when min_numbers exceed 10"""
        dataset = CountdownDataset(CountdownConfig(min_numbers=11, max_numbers=11, shuffle=False, size=5, seed=42))  # Set 11 engaged numbers for testing
    
        for item in dataset:
>           assert item["metadata"]["target"] == int(eval(item["metadata"]["expression"]))
E           AssertionError: assert 269 == 496
E            +  where 496 = int(496)
E            +    where 496 = eval('73 - 32 + 320 + 64 + 71 + 84 + 88 - 44 + 33 - 98 - 63')

tests/test_countdown.py:121: AssertionError
===================================== short test summary info ======================================
FAILED tests/test_countdown.py::test_countdown_more_numbers - AssertionError: assert 269 == 496
1 failed, 9 passed in 1.57s

With the newest version, pytest just simply passed:

# pytest tests/test_countdown.py 
..........                                                                                   [100%]
10 passed in 1.69s

@asaberui1
Copy link
Contributor Author

I'm sorry that I didn't do pre-commit checks. I have fixed those errors and now it should work.

# pre-commit run --all-files
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...............................................................Passed
check for added large files..............................................Passed
black....................................................................Passed
isort (python)...........................................................Passed

It works on my computer.

@olliestanley olliestanley merged commit 7d68a6c into open-thought:main Dec 15, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants